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REMARKS 

Claims 1-13,1 5-20, and 22-26 arc all the claims pending in the application. The Office 
Action indicates that pit)sccution of the application on the merits is closed in accordance with the 
practice under Ex parte Quayle, 1935 CD, 11, 453 OX*. 213, The Examiner has indicated in the 
Office Action of December 16, 2004 that claims 1-13, 15-20, and 22-26 are allowed, and that the 
application is in condition for allowance except for formal matters. Applicants acknowledge the 
Examiner's gracious allowance of the claims, and are submitting this amendment lo address the 
following formal matters- 
First, the limitations of claims 21, previously cancelled, and incorporated into claim 20 
are shown underlined ia amended claim 20. In particular, the following language (previously 
presented in claim 21, now cancelled) is underlined in claim 20, " wherein said selecting of 
feature weights are optimized by an objective function to produce said solution of a final 
clusterinc that simultaneously minimizes average intra-cluster dispersion and maximizes average 
inter-cluster dispersion along flll said feature spaces. " 

Second, the Applicants herein submit both a marked up version along with a clean 
version (without markings) of the substitute specification,, and herein indicate that no new matter 
is being added. Rather, the changes made in the substitute specification are being made in order 
to comply with the suggestions of the Examiner (in the Office Action of March 25, 2003), and in 
order to correct typographical errors found in the original specification, as well as to provide 
further clarification to The description of the invention. Applicants had originally submitted the 
substitute specification (marked-up version only) in an amendment dated May 13, 2003, which is 
prior to the date (July 30, 2003) that the USPTO rules of requiring a clean version (without 
markings) in addition to a marked-up version of a substitute specification became mandatory 
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under 37 C.F.R. §1.121. Nonetheless, in an effort to move along the prosecution of the 
application. Applicants are herein submitting a clean version (without markings) as well a* a 
marked-up version of the substitute specification as requested in the Office Action. 

In view of the foregoing, the Examiner is respectfully requested to pass the above 
application to issue at the earliest possible time. Should the Examiner find the application to be 
other than in condition for allowance, the Examiner is requested to contact the undersigned 
attorney at the local telephone number listed below to discuss any other changes deemed 
necessary. Please charge any deficiencies and credit any overpayments to Attorney* s Deposit 
Account Number 09-0*141 . 
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FEATURE WEIGHTING IN K-MEANS CLUSTERING 

BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention generally relates to data clustering and in particular, concerns a 
method and system for providing a framework lor integrating multiple, heterogeneous feature 
spaces in a t-mcans clustering algorithm. 

Description of the Related Art 

Clustering, the grouping together of similar data points in a data set, is a widely used 
procedure for analyzing data for "dula mining" applications data mining applications. Such 
applications of clustering include unsupervised classification and taxonomy generation, nearest- 
neighbor searching, scientific discovery, vector quantization, text analysis and navigation, data 
reduction and summarization, supermarket database analysis, customer/market segmentation, 
and time series analysi 5. 

One of the more popular techniques for clustering data of a set of data records includes 
partitioning operations (also referred to as finding "pattern vectors" pattern vectors ) of the data 
using a k-means clustering algorithm which generates a minimum variance grouping of data by 
minimizing the sum of squared Euclidean distances from cluster centmids. The popularity of the 
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k-means clustering algorithm is based on its ease of interpretation, simplicity of use, scalability, 
speed of convergence, parallelizability, adaptability to sparse data, and ease of out-of-core use. 

The k-means clustering algorithm functions to reduce data. Initial cluster centers are 
chosen arbitrarily. Records from the database arc then distributed among the chosen cluster 
domains based on minimum distances. After records are distributed, the cluster centers are 
updated to reflect the means or all the records in the respective cluster domains. This process is 
iterated so long as the cluster centers continue to move and converge and remain static. 
Performance of this algorithm is influenced by the number and location of the initial cluster 
centers, and by the order in which pattern samples arc passed through the program. 

Initial use of the k-means clustering algorithm lypically requires a user or an external 
algorithm to define the number of clusters. Second, all the data points within the data set are 
loaded into the function. Preferably, the data points are indexed according to a numeric field 
value and a record number. Third, a cluster center is initialized for each of the predefined number 
of clusters. Each cluster center contains a random normalized valued value for each field within 
the cluster. Thus, initial centers arc typically randomly defined. Alternatively, initial cluster 
center values may be predetermined based on equal divisions of the range within a field. In a 
fourth step, a routine is performed for each of the records in the database. For each record 
number from one to the current record number, the cluster center closest to the current record is 
determined. The record is then assigned to that closest cluster by adding the record number to the 
list of records previously assigned to the cluster. In a fifth step, after all of the records have been 
assigned to a cluster, the cluster center for each cluster is adjusted to reflect the averages of data 
values contained in the records assigned to the cluster. The steps of assigning records to clusters 
and then adjusting the cluster centers is repealed until the cluster centers move less than a 
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predetermined epsilon value. At this point the cluster centers are viewed as being static. 

A fundamental starting point for machine learning, multivariate statistics, or "data 
mining," data minin g, is where a data record can be represented as a high-dimensional feature 
vector. In many traditional applications, all of the features are essentially or the same "typ e ." 
type. However, many emerging data sets are often have many different feature spaces, for 
example: 

• Image indexing and searching systems use at least four different types of features: color, 
texture, shape, and location. 

• Hypertext documents c o ntains contain at least three different types of features: the words, 
the out-links, and the in-links. 

• XML (www jmJ.org) has become a standard way to represent data records; such records 
may have a number of different textual, referential, graphical, numerical, and categorical 
features. 

• Profile of a typical on-line customer such as an Ama20n.com customer may contain 
purchased books, music, DVD/vtdco, software, toys, etc. These above examples illustrate that 
data sets with multiple, heterogeneous features are indeed natural and common. Tn addition, 
many data sets on the University of California Irvine Machine Learning and Knowledge 
Discovery and Data Mining repositories contain data records with heterogeneous features. Data 
clustering is an unsupervised learning operation whose output provides fundamental techniques 
in machine learning and statistics. Statistical and computational issues associated with the k- 
means clustering algorithm have extensively been used for these clustering operations. The same 
cannot be said, however, for another key ingredient for multidimensional data analysis: 
clustering data records having multiple, heterogeneous feature spaces. 
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SUMMARY OF THE INVENTION 

It is, th e r e for e ., <in object of th e pr e s e nt The invention to provid e provides a method and 
system for integrating multiple, heterogeneous feature spaces in a i-means clustering algorithm. 
The method of the invention adaptivcly selects the relative weights assigned to various features 
spaces?, which simultaneously attains good separation along all of the feature spaces. 

The invention integrates multiple feature spaces in a *>mcans clustering algorithm by 
assigning different relative weights to these various features spaces. Optimal feature weights are 
also determined that can be incorporated with this algorithm that lead to a clustering that 
simultaneously minimizes the average intra-cluster dispersion and maximizes the average inter- 
ciuster dispersion alonj; ail the feature spaces. 

DESCRIPTION OF THE DRAWINGS 

The foregoing and oth e r objocto, asp e cts and advantag e s will be better understood from 
the following detailed description of preferred embodiments of the invention with reference to 
the drawings, in which 

HO. la and lb FIGs. 1 A and IB show a data computing system and method of the 
invention respectively; 

FIGs. 2a, 2b, 2c and 2d show graphs of a first e xampl e using th e invention wkerein the 
HEART (r e ap. ADULT) data, in FIQai 2a and 2b respectiv e ly, how a plot of th e "obj e cti ve" 
function Ql x Q2 in e quation (6) v e rsus th e w e ight o-xi The HEART (r e ap. ADULT) data, th e 
FIGs, 2o and 2d r e sp e c ti vely, show a plot of macro - p (roop. mioro - p, macro - p, and macro r) 
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veraua the weight CC\ ; • 2 A, 2ft« 2C, and 2D show graphical results of an example using the 
invention: 

FIG. 3 shows the feasible weights for the second exemplary use of the invention wherein 
when the feature space is 3. and the triangular region formed by the intersection of the plane at 

or, + a 2 + QTj = 1 widi the nonnegativc orthant of K 3 ; ^ 

FIG. 4 shows a newsgroups data set, in which plot raacro-p versus the "objective** 
objective function Q { x Q 2 X for various different weight tuples[[J]iand 

FIG. 5 is a flowchart illustrating a preferred method of the invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS OF THE INVENTION 
1. INTRODUCTION 

Whilo tho inven ti on i s primarily disclo s ed uu a method, it will bo understood by u poruon 

of ordinary skill in th e art that an apparatus, s uch as a conv e ntiona l data processor, including a 
CFU, memory, I/O, prugram storag e , a conn e cting bus, and oth e r appropriat e compon e nts, could 
b e programmed or otherwis e d e sign e d to facilitate th e practic e of th e m e thod of th e inv e ntion 
Such a proc e ssor woul d inolud e appropriat e program mean s for executing the m e thod of th e 
inv e ntion* Aloo^ an article of manufactur e , such as a pro - rooordod diok or oth e r similar computer 
program product, for u * > e with a data proc ess ing s yst e m, could includ e a storag e m e dium and 
program moans rucord u d thereon for dir e cting th e data proc e ssing s yst em to fae ilitatc the 
praotic e of th e method of th e invention, It wil l b e und e rstood that such apparatus and articles of 
manufactur e also fal l within the opirit and scope of the inv e ntion, 
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1. Introduction 

While the invention is primarily disclosed as a method, it will be understood by a person 
of ordinary skill in the art that an apparatus, such as a conventional data processor, including a 
CPU, memory, I/O, program storage, a connecting bus, and other appropriate components, could 
be programmed or otherwise designed to facilitate the practice of the method of the invention. 
Such a processor would include appropriate program means for executing the method of the 
invention. Also, an article of manufacture, such as a pre-recorded disk or other similar computer 
program product, for uw with a data processing system, could include a storage medium and 
program means recorded thereon for directing the data processing system to facilitate the 
practice of the method of the invention. It will be understood that such apparatus and articles of 
manufacture also fall within the spirit and scope of the invention. 

FIG. ±a 1A shows an exemplary data processing system for practicing the disclosed 
feature weighted K-means data clustering analysis methodology that includes a computing 
device in the form of a conventional computer 20, including one or more processing units 21 , a 
system memory 22, and a system bus 23 that couples various system components including the 
system memory to the processing unit 21 . The system bus 23 may be any of several types of bus % 
structures including a memory bus or memory controller, a peripheral bus, and a local bus using 
any of a variety of bus architectures. 

The system memory includes read only memory (ROM) 24 and random access memory 
(RAM) 25. A basic input/output system 26 (BIOS), containing the basic routines that helps to 
transfer information between elements within the computer 20, such as during start-up, is stored 
in ROM 24. 

The computer 20 further includes a hard disk drive 27 for reading from and writing to a 
ARC9-2000-0078-US1 8 
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hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable 
magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical 
disk 31 such as a CD-ROM or other optical media. The hard disk drive 27, magnetic disk drive 
28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 
32, a magnetic disk drive interface 33, and an optical drive interface 34, respectively. The drives 
and their associated computer-readable media provide nonvolatile storage of computer readable 
instructions, data structures, program modules and other data for the computer 20. Although the 
exemplary environment described herein employs a hard disk, a removable magnetic disk 29* 
and a removable optical disk 31 , it should be appreciated by those skilled in the art that other 
types of computer readable media which can store data that is accessible by a computer, such as 
magnetic cassettes, fla^h memory cards, digital video disks, Bernoulli cartridges, random access 
memories (RAMs), read only memories (ROM ROMs ), and the like, may also be used in the 
exemplary operating environment. Data and program instructions can be in the storage area thai 
is readable by a machine, and that tangibly embodies a program of instructions executable by the 
machine for performing the method of the present invention described herein for data mining 
applications. 

A number of program modules may be stored on the hard disk, magnetic disk 29, optical 
disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application 
programs 36, other program modules 37, and program data 38. A user may enter commands and 
information into the computer 20 through input devices such as a keyboard 40 and pointing 
device 42. Other input devices (not shown) may include a microphone, joystick, game pad, 
satellite dish, scanner, or the like. These and other input devices are often connected to the 
processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be 
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connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB). 
A monitor 47 or other type of display device is also connected to the system bus 23 via an 
interface, such as a video adapter 48. In addition to the monitor, personal computers typically 
include other peripheral output devices (not shown), such as speakers and primers. The 
computer 20 may operate in a networked environment using logical connections to one or more 
remote computers, such as a remote computer 49. The remote computer 49 may be another 
personal computer, a server, a router, a network PC, a peer device or other common network 
node, and typically includes many or all of the elements described above relative to the computer 
20, although only a memory storage device 50 has been illustrated in FIG. 1A. The logical 
connections depicted in FIO. +a 1A include a local area network (LAN) 51 and a wide area 
network (WAN) 52. Such networking environments are commonplace in offices* enterprise-wide 
computer networks, inirancts and the Internet. 

When used in a LAN networking environment, the computer 20 is connected to the local 
network 5 1 through a network interface or adapter 53. When used in a WAN networking 
environment, the computer 20 typically includes a modem 54 or other means for establishing 
communications over the wide area network 52, such as the Internet. The modem 54, which may 
be internal or external, is connected to the system bus 23 via the serial port interface 46. In a 
networked environment, program modules depicted relative to the computer 20, or portions 
thereof, may be stored in the remote memory storage device. It will be appreciated that the 
network connections shown are exemplary and other means of establishing a communications 
link between the computers may be used. 

The method of the invention as shown in general form in FIG. 4h IB, may be 
implemented using standard programming and/or engineering techniques using computer 
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programming software, firmware, hardware or any combination or subcombination sub- 
combination thereof. Any such resulting program(s), having computer readable program code 
means, may be embodied or provided within one or more computer readable or usable media 
such as fixed (hard) drives, disk, diskettes, optical disks, magnetic tape, semiconductor memories 
such as read-only memory (ROM), etc., or any transmitting/receiving medium such as the 
Internet or other communication network or link, thereby making a computer program product, 
i.e., an article of manufacture, according to the invention. The article of manufacture containing 
the computer programming code may be made and/or used by executing the code directly Irom 
one medium, by copying the code from one medium to another medium, or by transmitting the 
code over a network. 

The compuling system for implementi ng the method of the invention can be in the form 
of soft ware* firmware, hardware or any combination or subcombination sub-combination thereof, 
which embody the invention. One skilled in the art of computer science will easily be able to 
combine the soil ware created as described with appropriate general purpose or special purpose 
computer hardware to <;reate a computer system and/or computer subcomponents embodying the 
invention and to create a computer system and/or computer subcomponents for carrying out the 
method of the invention. 

The method of ihc invention is for clustering data, by establishing a starting point at step 
1 as shown in WG.-tb .LB wherein, a given data set having m feature spaces, and each data object 
(record) is represented as a tuple of m feature vectors. To cluster, a measure of "distortion" 
distortion between two data records is needed. Since, different types of features may have 
radically different statistical distributions, in general, it is unnatural to disregard fundamental 
differences between various different types of features and to impose a uniform, un- weighted 
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distortion measure across disparate feature spaces. Tn Section 2 below, a distortion between two 
data records as a weighted sum of suitable distortion measures on individual component feature 
vectors is provided; where the distortions on individual components are allowed to be different. 
In Section 3 below, using a "conv e x programming" convex programming formulation, the 
classical Euclidean Jfc-nieans algorithm is generalized to use the weighted distortion measure. In 
Section 4 below, optimal feature weights are selected that lead to a clustering that simultaneously 
minimizes the average intra-clustcr dispersion and maximizes the average inter-cluster dispersion 
along &U all the feature spaces. In Section 5, an outline evaluation strategy is provided. In 
Sections 6 and 7, two exemplary uses of the invention are provided for a) clustering data sets 
with numerical and categorical features; and h) clustering text data sets with words, 2-phrases, 
and 3-phrases respectively. Using data sets with a known ground truth classification, the 
clusterings are empirically demonstrated that correspond to the optimal feature weights and 
deliver nearly optimal preci si on/recall performance. 

Feature weighting may be thought of as a general i?at ion of feature selection where the 
latter restricts attention to weights that are either I (retain the feature) or 0 (eliminate the 
feature), see Wettschereck et al.. Artificial Intelligence Review in the article entitled U A review 
and empirical evaluation of feature weighting methods for a class of lazy learning algorithms," 
Vol 1 1 » pps. 273-31 4, 1 997. Feature selection in the context of supervised learning has a long 
history in machine learning, see, for example, for e xampl e , s ee Blum et al., Artificial 
Intelligence* "Selection of relevant features and examples in machine learning," Vol. 97, pps. 
245-271 , 1 997. Feature selection in the context of unsupervised learning has only recently been 
systematica] ly studied. 
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2. Data Model and a Distortion Measure 

2A Data Model: Assume thai a set ofdaia records where each object is a tuple of m 
component feature vectors arc given. A typical data object is written as: x - (F if F^.-.F^), where 
the Mh component feature vector F/, 1-sls m, is to be thought of as a column vector and lies in 
some (abstract) feature space /->. The data object x lies on the m -fold product feature space F - 

FfX F 2 x...F m . The feature spaces {i 7 /}^ can be dimensionallv different dimensional and 

possess different topologies, hence, the data model accommodates heterogeneous types of 

features- There arc two examples of feature spaces that include: 

Euclidean Case: Ft is cither S // > 1, or some compact submani fold thereof 

Spherical Case: Fi is the intersection of the/ rdimensional,// ,> 1 , unit sphere wilh the non- 
negative orthant of R ^ „ 

2*2 A Weighted Distortion Measure: Measuring distortion between two given two 



data records x ~ (Ft, F- f ...<F m )and X = 



For 1 < / </w, let D/ denote a 



distortion measure between the corresponding component feature vectors F/ and Fi . 
Mathematically, only two needed properties of the distortion function: 

• Z?/:F/xF/ -->(0,qo) 

• For a fixed F/ , D/ is convex in F t . 
Euclidean Case The squared-Euclidean distance: 
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trivially satisfies the non-negativity and, tor A &[0,l\ the convexity follows from : 



F h ZF l +{]-X)Fi < ADA F h Fi +(l 



-A)D^F h F^. 



Spherical Case The cosine distance £>/ F/ , F / = 1 - F/ F / iri vially satisfies the non 



negativity and, for A €:[0,l], the convexity follows from: 



AFMl-A)F t 



AF f + {\-A)F\ 



where 1 1 „.| | denotes the Euclidean-norm, The division by: 1 1 A F / I- (l A) F / 1 1 ensures that 
the second argument of Di is a unit vector. Geometrically, the convexity along the geodesic are 

defined as connecting the two unit vectors F / and F/ and not along the chord connecting the 
two ar e d e fin e d . Given m valid distortion measures \D( } between the corresponding m 

component feature vectors of x and a weighted distortion measure between x and X is defined 



as: 



14 
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D a \x,x\ = £ a l D l \F h F lj 

/ = ] 

where the feature weights (07 j arc non-negative and sura to 1 and or = (or/ , . . . Gt m )l 
The weighted distortion D a is a convex combination of convex distortion measures, and hence, 

for a fixed X,D a is the convex in X. The feature weights {ty}^ are enabled in the method, 

and arc used lo assign different relative importance to component feature vectors. In Section 4 
below, appropriate choice of these parameters is made. 

3, A- Means with Weighted Distortion 

3.1. The Problem: Suppose that «-data records are given such that 



: > = {%i> F <tt}-- F (i*d> 1 * i * n > 



where the /-th, 1 s / <m 7 component feature vector of every data record is in the feature space fj. 
Partitioning of the data set {*, }" =1 i$ sought into k-dlsjoint clusters {ft u } w=] - 

3.2 Generalized Centroids: Given a partitioning {nr u }* _ , % for each partition 7T U , 
write the corresponding general ised centroid generalized centroid as 

c «={ c (u,l> C (u,2)> - > C (u,m) ) 

where, for 1 < / > m, the /-th component t '( tti /) is in F I ■ c u as tne solution of the following 
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"tfrtjmac programming " prohium convex programming problem is defined as: 

f \ 



c u =arg mm 



(1) 



In an empirical average sense, the generalized centroid may be thought of as being the closest in 

D a to all the data records in the cluster K u . The key to solving (1) is lo observe that D a is 

component-wise-con vt-x, and, hence, equation (1 )can be solved by separately solving for each of 
its m components c^ u 1 <] < m. In other words, the following m "convex programming'' 

probloma convex progiamming problem is solved: 

f \ 
£ Di{F h F t ) 



F I e Ft 



\X<=7C 



(2) 



For the two feature spaces of interest (others as well), the solution of equation (2) can be written 
in a closed form using a Euclidean and. Spherical case, respectively: 



^n™ 1 1 1 



[W^F,\\ 
where x - (f { ,'F 2 . ~,F m ) 
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3.3 The Method: Referring to FIG- +h ]_B, the method of the invention uses the 
formulation of equation (1) using the steps below, wherein the distortion is measured of each 
individual cluster 7T U ,1 < u <_k, as: 



I D a (x,c u ), 



and the quality of the entire partitioning {x u as the combined distortion of all the k 

clusters: *i=i What is sought is k-disjoint clusters such that equation (3) as 

follows is minimized wherein these k-disjoint clusters are: 
t t t , 



f k ^ 
K.u = 1 x € n u j 



{ V t = a r g m in 

where the feature weigiits a = (a\ , #2 , * - Ct m ) are fixed. When only one of Ihe weights 

\cc\ jj^j is nonzero, the maximization problem (3) is known to be NP-complctc, meaning no 

known algorithm exists for solving die problem in polynomial lime. Jf-meam is used, which is 
an efficient and effective data clustering algorithm. Moreover, A-means can be thought of as a 
gradient ascent gradient ascent method, and, hence, never increases the "objective" objective 
function and eventual!;' converges to a local minima. 
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In FTG, IB, m overview oflhe processing components is shown for clustering data. 
These processing components perform data clustering on data records stored on a storage 
medium such as the computer system's himl disk drive 27. The data records typically are made 
up of a number of data fields or attributes. Examples of such data records is are discussed below 
in the two examples ef implementing the invention. 

The components that perform the clustering require three inputs: the number of clusters 
K 5 a set of K initial starting points, and the data records to be clustered. The clustering of data by 
these components produces a final solution as shown in step 5 as an output. Bach of the K 
clusters of this final solution is represented by its mean (centroid) where each mean has d 
components equal to the number of attributes of the data records and a fixed feature weight of 
the m -feature spaces. 

A refinement of feature weights in step 4 below produces "bettor" better clustering from 
the data records to be clustered using the methodology of the invention. A most favorable refined 
starting point produced using a "good" good approximation of an initial starting point is 
discussed below that would move the set of starting points that are closer to the modes of the 
data distribution. 

At Step 1: An inilial point with an arbitrary partitioning of the data records of the data 



generalized centroids associated with the given partitioning. Set the index of iteration t = 0. A 
choice of the initial partitioning is quite crucial to finding a "good" good local minima; to 



records to be evaluated is provided, wherein. 




Let 




achieve this, sec a method for doing this technique as taught in U.S. patent 6,1 15,708 hereby 



incorporated by reference. 
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At Step 2: For each data record ,1 < / < n , find the generalized centroid that is closest 

to X x r . Now, for 1 < u < k y compute the new partitioning +1 * induced by the old 

I )u-l 

generalized centroids: 

U^0 = {r 6 {. j }^: D «(,.cf)) S D«(,. e ,«)l S , 54 }. (4) 

Tn words JT^""^"^ / \ 

* w is the set of all data records that are closest to the generalized centroid , 

If some data object is simultaneously closest to more than one generalized centroid, then it is 
randomly assigned to one of the clusters. Clusters defined using equation (4) are known as 
Voronol or Dirichlet piutitions. 

Step 3: Compute the new generalized centroids jtJ = 7$ 

corresponding to the partitioning computed in equation (4) by using equations (l)-(2) where 

instead of n u , the following is used: &ff +1 \ 

Step 4; Optionally refine the features weights and store in memory using the method 
discussed in Section 4 below. 

Step 5: If some "stopping criterion - stopping criterion is met, then set and set 

c t = -0+l) 

u c u for 1 <_u< A, and exit at step 5 as the final clustering solution. Otherwise, 
increment / by 1, and go to step 2 above and repeat the process. An example of a stopping 
criterion is: Stop if the change in the "objective" objective function as defined in equation (6) 
below, between two successive iterations, is less than a specified threshold, for example the 
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generalized ceniroids do not move and being less than a small floating point number. 



4, Choice of the Feature Weights: Throughout this section, fix the number of clusters k<2 
and lix the initial /f « " "« partitioning used by the A-muans algorithm in Step 1 above. Let: 



A = 4 



a 



m 



Z ai = >Q,l<l< m 
l = \ 



denote the set of all possible feature weights. Given a feature weight (I gA, 



«<«)={*i}J =I 



denote the partitioning obtained by running the £-means algorithm with the fixed initial 
partitioning and the given feature weight Tn principle, the Jb-means algorithm for every possible 

feature weight in A is run. From the set of all possible clusterings {xifiC^Cc e A} , by selecting 

select a clustering that is in some sense the h&A best , by introducing a fignre-of-mcrit to compare 

various clusterings. Fi\ a partitioning By focusing on how well this partitioning 

clusters along the /-th, 1 <; / <m, component feature vector, the average within clusters 
distortion can be defined and average between cluster distortion along the /-th component 
feature vector, respeoli vely, as: 
k 



U = 1 X € 7I V 
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A/(a)= k = i 



k 



k 



u~\x e^r M v =l,v ^ w 



where ...F tH \ It is desirable to minimize and to maximize A/(aX(i.c, 

coherent clusters that nre well-separated from each other is desirable). Hence, minimize: 



Ia/(«)J 



(5) 



where «/ denotes the number of data records that have a non-zero feature vector along the /-th 
component. The quaniity n, is introduced to accommodate sparse data sets. If the /-th feature 
space is simply M fl and D t is the squared-Euclideun distance, the Tj (a) is simply the trace 

of the within-class covariancc matrix and A t (<x) is the trace of the between class covariance 
matrix. Tn this case, 

Ql(a) 

is the ratio used in dettraiining the quality of a given classification, and as the *»objectivu" 
objective function underlying his this multiple discriminant analysis. Minimizing Q; (a) leads 
to a good discrimination along the /-the component feature space. Since it is desirable to 
simultaneously simultaneously attain good discrimination along all the m feature spaces, the 
optimal feature weights a * are selected as: 



t 

a 1 = ar g min 
cc e A 



m 

n QiM 



(6) 
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5. Evaluating Method Effectiveness: In assuming that the optimal weight tuple a by 
minimizing the "obj e ctive" objective function in (6) has been selected. How good is the 
clustering corresponding to the optimal feature weight tuple? To answer this, assume that re- 
classified data is given and benchmark Hie precision/recall performance of various clusterings 
against the given ground truth. Precision/recall numbers measure the "ovorlap" overlap between 
a given clustering and The ground truth classification. This precision/recall numbers arc not used 
in the selection of the optimal weight tuple, and are intended only to provide a way of evaluate 
evaluating utility of feature weighting. By using the precision/recall numbers to only compare 
partitionings with a fix ad number of clusters k, that is, partitionings with the same "mod e l 
c ompl e xity" model complexity ^ a measure of effectiveness is provided demonstrated . 

To meaningfully define precision/recall, conversion of the clusterings into classification 
using the following simple rule is made by identifying each cluster with the class that has the 
largest overlap with the cluster, and assign every element in that cluster to the found class. The 
rule allows multiple clusters to be assigned to a single class, but never assigns a single cluster to 

multiple classes. Suppose there arc c classes {fy } _ j in the ground truth classification. For a 

given clustering, by using the above rule, let a, denote the number of data records thai are 
correctly assigned to the class w u and let b x denote the data records that are incorrectly rejected 
from the class ti) t , let b \ denote the data records that are incorrectly rejected from the class CO t . 
Precision and recall are defined as: 
^ a* a* 

Pt= — ~- = r / = — —,l<t<c, 

The precision and recall are defined per class. Next, the performance averages across classes 
ARC9-2000-0078-US 1 22 
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using macro-precision (maero-p), macro-recall (macro-r), micro-precision (micro-p), and micro- 
recall (micro-r) are captured by: 

1 6 1 ° 

macro-p="" y M p t aad macro-r = — y , r t 
v c 
/ -1 t=] 

(a) 1 ° 

micro-p = micro-r =— 2 a n 

n 

t = l 

where (a) follows since, in this case, = j (a t +b t ^— £ J=i (a r + = n* 



6. Examples of Use of the Method of Clustering Data Sets with Numerical and 
Categorical Attributes 

6.1 Data Model: Suppose a data set with both numerical and categorical features is given. 
By "lin e arly scaling" li nearly scaling each numerical feature, thai is, subtracting the mean and 
divide by the square-root of the variance. All linearly scaled numerical features into one feature 
space are clubbed, and, for this feature vector, the squared-Euclidean distance is used, by 
representing each <jr-ar> categorical feature using a 1-uw/ representation, and all of the 
categorical features are clubbed into a single feature space. Assuming no missing values, all the 
categorical feature vector have the same norm, by only retaining the "dir e ction" direction of the 
categorical feature vectors, that is, and normalizing each categorical feature vector to have an 
unit Euclidean norm, and use the cosine distance. Essentially, each data object x is represented as 

a /rt-luptej m ~ 2, of feature vectors {t\ , )* 

6-2 HEART »nd ADULT Data Sets: FTGs. 2n, 2b ? 2o and 2d 2A. 2B. 2C and 2D 
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show graphs of a first example using the method of the invention wherein the using HEART 
(reap* ADULT) data, fi irther defined below, FlOs. ±e 1A and 4-b Ifi respectively, each show a 
plot of the "obj e ctiv e - objective function Q\ x Ql in equation (6) versus the weight c~x. The 
vertical lines in HGs. 2 a, 2b, 2c and 2d 2 ^ 2B, 2C and 2D indicate the position of the optimal 
weight tuples. Fot the HEART (resp. ADULT) data, the FIGs. 2o and 2d 2C and 2D 
respectively, each show s show a plot of macro-p (rcsp. micro-p, macro-p, and macro-r) versus 
the weights, . For the HEART data, macro-p, macro-r* and micro-p numbers are very close to 

each other, thus, to avoid visual clutter, only plotted macro-p numbers are shown. For the 
ADULT data, the top, the middle, and bottom plots in FIG. 34 2D are micro-p, macro-p, and 
macro-r. 

The HEART data set consists of w - 270 instances, and can be obtained from the 
STATLOG repository f hup ://www. ncc.up.pt/liacc/M L/ntatlo g/. livery instance consists of 7 
numerical and 6 categorical features. The data set has two classes: absence and presence of heart 
disease; 55.56% (resp, 44.44%) instances were in the former (resp. later) class. The ADULT 
data set consists of n ^ 3256 1 instances that were extracted from the 1994 Census database. 
Every instance consists of 6 numerical and 8 categorical features. The data set has two classes: 
those with income less than or equal to $50,000 and those with income more than $50,000; 
75.22% (resp. 24.78%) instances were in the former (resp. later) class. 

6.3 The Optimal Weights: In this case, the set of feasible weights is 

A = |(<Z| , a 2 )jcri + 1*2 - 1 , (Z\ , >0 }■ The number of clusters k - 8 is selected, and a binary 
search on the weight <Z\ € [0J ] is done to minimize the "obj e ctiv e 1 ' objective function in 
equation (6). 
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For the HEART (resp. A DUT.T) data, th e top l e ft (r e . 'i p. top right) pan e l in Figur e 1 
shews HQs. 2A and 2H, respectively show a plot of the "obj e ctiv e " objective function Q\ x Q 2 

in (6) versus the weight «*l a± For the HEART and ADULT data sets, the "obj e ctiv e " objective 
ftinction is minimized by the weights (0.1 2, 0.88) and (0.1 1 , 0.89), respectively. 

For the HEART (resp. ADULT) data, the bottom - l e ft (r e sp. bottom right) pan e l in Figur e 
2a - d show s FIGs. 2C and 2D» respectively show a ploL of rnacro-p (rcsp. micro-p, maero-p, and 
macro-r) versus the weight dfj . By comparing th e top l e ft (reap* lop right) panel with th e 

bottom l e ft (r o sp. bottom right) panel FIG. 2A with FIG. 2C and FIG. 2B with FIG. 2D , it can 
he seen that, roughly, niacro-p (resp. micro-p, macro-p, and macro-r) are ncgativdy correlated 
negatively correlated with the "obj e ctiv e " objective function Q\ x Q 2 and that, in fact, the 
optimal weight tuples achieve nearly optimal precision and recall. In conclusion, optimizing the 
"objective" objective function 01 X Q 2 leads, reassuringly, to optimising the precision/recall 

performance, thus leads to good clusterings and a final solution. 

7, Second Example of Using the Method of Clustering Text Data using 
Words and Phrases as the Feature Spaces 

7.1. Phrases in Information Retrieval: Vector space models, represent each document 
as a vector of certain (possibly weighted and normalized)_terni frequencies. Typically, terms are 
single words. However, to capture word ordering, it is intuitive to also include multi-word 
sequences, namely, phrases, as terms. The use of phrases as terms in vector space models has 
been well studied. In the example as follows, the phrases along with the words in a single vector 
space model are not a c lub. For information retrieval, when single words arc also simultaneously 
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used, it is known that natural language phrases do not perform significantly better than statistical 
phrases. Hence, the focus is on statistical phrases which are simpler to extract, see, lor example, 
see Agrawul el al. in "Mining sequential patterns," Proc, Int Conf. Data Eng., (1995). Also, sec 
Mladenic et al., "Word sequences as features in text-learning" in Proc. 7th FJeclrolech[[.]] 
Computer[[.]] Science Conference, Ljubljana, Slovenia, pages 145- 148, (1998) found that while 
adding 2-and 3-word phrases improved the classifier performance, longer phrases did not. 
Hence, the example illustrates single words, 2-word phrases and 3-word phrases. 

7.2 Data Model: FIG. 3 shows the feasible weights for the second exemplary use of the 
invention wherein when m = 3, A is the triangular region formed by the intersection of the plane 

at a } + a 2 + = 1 with the nonnegativc orthant of R 3 . The left-vertex, the right-vertex, and 

the top-vertex of the triangle corresponds to the points (1,0, 0), (0, 1 , 0), and (0, 0, 1 ), 
respectively. Each document is represented as a triplet of three vectors: a vector of word 
frequencies, a vector of. 2-word phrase frequencies, and a vector of 3-word frequencies;, (hat is, a 
= (^i * f*\ ft)- It is now shown how to compute such representations for every document in a 
given corpus. This creation of the first feature vector is a standard exercise in information 
retrieval. The basic idea is to construct a word dictionary of all the words that appear in any of 
the documents in the corpus, and to prune or eliminate a stop word from this dictionary. For the 
present application, also eliminated are those low-frequency words which appeared in less that 
0.64% of the documents. Suppose // unique words remain in the dictionary after such 
elimination. Assign an unique identifier from 1 to fy to each of these words. Now, for each 
document x in the corpus, the first vector// in the triplet will be a//ff-Hdimensional vector. The 
jth column entry, 1 <j < : fe> of Fj is the number of occurrences of the y'th word in the document x. 
Creation of the second (resp. third) feature vector is essentially the same as the first, except that 
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the low- frequency 2-word (rcsp. 3-word) phrase elimination threshold to one-half (resp. 3-word) 
phrase is generally less likely than a single word. Let f 2 (resp./j) denote the dimensionalities of 
the second (resp. third) feature vector. Finally, each of the three components F\ >F 2 , F 3 \$ 
normalized to have a utiit Euclidean norm, that is, their directions arc retained and their lengths 
are discarded. There are a large number of term- weighting schemes in information retrieval for 
assigning different relative importance to various terms in the feature vectors. These feature 
vectors correspond to & popular scheme known as normalized term frequency. The distortion 
measures Di, D2, and D3 are the cosine distances, 

73 Newsgroups Data: Picked out of the following 10 newsgroups from the 
"Newsgroups data" from a newspaper to illustrate the invention: misc.forsale 
sei. crypt comp.windows^x Comp.sys. mac. hard ware 

rec.autos rec.sport.basehall soc.icligion.christian 
sci .space talk.politics.guns talk.politics.midcast 

Each newsgroup contains 1000 documents; after removing empty documents, a total of n -■ 9961 
documents exist. For tlds data set the unpruned word (rcsp. 2-word phrase and 3-word phrase) 
dictionary had size 72586 (resp. 429604 and 461 132) out of which f/ - 2583 (rcsp.^S = 2144 and 
f2 = 2268) elements thai appeared in at least 64 (resp. 32 and 16) documents was retained. All 
the three features spaces were highly sparse; on an average, afler pruning, each document had 
only 50 (resp. 7. 19 and 8.34) words (resp. 2-word and 3-word phrases). Finally, /?/-/*- 9961 
(resp. m - 8639 and wj - 4664) documents had al least one word (resp. 2-word phrase and 3- 
word phrase). Note that the numbers n f , raj, and n$ are used in equation (5) above. 

7.4 The Optimal Weights: FIO. 4 shows a newsgroups data set, in which plot macro-p 
versus the "obj e ctiv e " objective function x Q 2 X0 3 for various different weight tuples, 
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The macro-p value corresponding to the optimal weight tuple is shown using the symbol R m, 
and others arc shown using the symbol The "n e gativ e corr e lation" negative correlation 
between macro-p and the "obj e ctiv e " ob jective function is evident from the plot. Macro-p, 
roacro-r, and micro-p numbers are very close to each other, thus, to avoid visual clutter, only 
plotted macro-p numbers are shown. In this case, the set of feasible weights is the triangular 
region shown in FIG. 4. The £-means algorithm with k= 10 on the Newsgroups data set with 31 
different feature weights that are shown using the symbol • in ITG A The "objective" objective 
function in equation (6) is minimized by a weight tuple (0.50, 0,25, 0.25) was run . 

Further, FIG. 4 roughly shows thai as the "ubjootivo" objective function decreases, 
macro-p increases. The "obj e ctiv e " objective function corresponding to the optimal weight tuple 
is plotted using the symbol; by definition, this is the left-most point on the plot. It can be seen 
that the optimal weight tuple has a smaller macro-p value than only one other weight tuple, 
namely, (.495, .010, .495). Although macro-r and micro-p results are not shown, they lead to the 
same conclusions. Use of the precision/recall numbers reveals that optimal feature weighting 
provides good final data clustering. 

As discussed above, it has been assumed that the number of clusters k is given; 
however, an important problem that the invention can be used for is to automatically determine 
the number or clusters in an adaptive or data-driven fashion using intbrmation-theoretic criteria 
such as the MDL principle. A computationally efficient gradient descent procedure for 
computing the optimal parameter tuple a t can be used. This entails combining the optimization 
problems in equation (<i) and in equation (3) into a single problem that can be solved using 
iterative gradient descent heuristics for the squared-Euclidean distance. In the method of the 

invention, the new weighted distortion measure D a has been used in the A-mcans algorithm; it 
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may also be possible to use this weighted distortion with a graph-based algorithm such as the 
complete link method or with hierarchical agglomcrative clustering algorithms. 

In summary, I he invention provides a method for obtaining good data clustering by 
integrating multiple, heterogenous feature spaces in the A-means clustering algorithm by 
adaptively selecting relative weights assigned to various features spaces while simultaneously 
simultaneously attaining good separation along &U all the feature spaces. 

Afl prcover. FIG. 5 illustrates a preferred method of the invention fully described above, 
where a m ethod for evaluating and outDUtting a clustering solution for a plurality of multi- 
dimensional data records comprises defining (1 10) a distortion between two feature vectors as a 
weighted sum of distor tion measures on components of the feature vectors: clustering (1 12 ) the 
multi-dimensional data records into k-cluslcrs using a convex programming formulation: and 
selecting (114) optima] feature weights of the feature vectors b y an obj ective function to produce 
a solution of a final clustering that simultaneously minimizes average intra-cluster dispersion and 
maximizes average int<a>cluster dispersion along all the feature spaces. 

While the preferred embodiment of ihe present invention has been illustrated in detail, it 
should be apparent thai modifications and adaptations to that embodiment may occur to one 
skilled in the art without departing from the spirit or scope of the present invention as set forth in 
the following claims. 
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